Lung cancer signature

ABSTRACT

The present disclosure relates to compositions and methods for cancer diagnosis, research and therapy, including but not limited to, cancer markers. In particular, the present disclosure relates to cancer markers as diagnostic markers and clinical targets for lung cancer.

The present application claims priority to U.S. Provisional PatentApplication Ser. No. 61/904,711, filed Nov. 15, 2013, the disclosure ofwhich is herein incorporated by reference in its entirety.

FIELD OF THE INVENTION

The present disclosure relates to compositions and methods for cancerdiagnosis, research and therapy, including but not limited to, cancermarkers. In particular, the present disclosure relates to cancer markersas diagnostic markers and clinical targets for lung cancer.

BACKGROUND OF THE INVENTION

Lung cancer remains the leading cause of cancer death in industrializedcountries. About 75 percent of lung cancer cases are categorized asnon-small cell lung cancer (e.g., adenocarcinomas), and the other 25percent are small cell lung cancer. Lung cancers are characterized in toseveral stages, based on the spread of the disease. In stage I cancer,the tumor is only in the lung and surrounded by normal tissue. In stageII cancer, cancer has spread to nearby lymph nodes. In stage III, cancerhas spread to the chest wall or diaphragm near the lung, or to the lymphnodes in the mediastinum (the area that separates the two lungs), or tothe lymph nodes on the other side of the chest or in the neck. Thisstage is divided into IIIA, which can usually be operated on, and stageIIIB, which usually cannot withstand surgery. In stage IV, the cancerhas spread to other parts of the body.

Most patients with non-small cell lung cancer (NSCLC) present withadvanced stage disease, and despite recent advances in multi-modalitytherapy, the overall ten-year survival rate remains dismal at 8-10% (Fryet al., Cancer 86:1867 [1999]). However, a significant minority ofpatients, approximately 25-30%, with NSCLC have pathological stage Idisease and are usually treated with surgery alone. While it is knownthat 35-50% of patients with stage I disease will relapse within fiveyears (Williams et al., Thorac. Cardiovasc. Surg. 82:70 [1981];Pairolero et al., Ann. Thorac. Surg. 38:331 [1984]), it is not currentlypossible to identify which specific patients are at high risk ofrelapse.

Adenocarcinoma is currently the predominant histologic subtype of NSCLC(Fry et al., supra; Kaisermann et al., Brazil Oncol. Rep. 8:189 [2001];Roggli et al., Hum. Pathol. 16:569 [1985]). While histopathologicalassessment of primary lung carcinomas can roughly stratify patients,there is still an urgent need to identify those patients who are at highrisk for recurrent or metastatic disease by other means. Previousstudies have identified a number of preoperative variables that impactsurvival of patients with NSCLC (Gail et al., Cancer 54:1802 1984];Takise et al., Cancer 61:2083 [1988]; Ichinose et al., J. Thorac.Cardiovasc. Surg. 106:90 [1993]; Harpole et al., Cancer Res. 55:1995]).Tumor size, vascular invasion, poor differentiation, high tumorproliferate index, and several genetic alterations, including K-ras(Rodenhuis et al., N. Engl. J. Med. 317:929 [1987]; Slebos et al., N.Engl. J. Med. 323:561 [1990]) and p53 (Harpole et al., supra; Horio etal., Cancer Res. 53:1 [1993]) mutation, have been reported as prognosticindicators.

Tumor stage is an important predictor of patient survival, however, muchvariability in outcome is not accounted for by stage alone, as isobserved for stage I lung adenocarcinoma which has a 65-70% five-yearsurvival (Williams et al., supra; Pairolero et al., supra). Currenttherapy for patients with stage I disease usually consists of surgicalresection and no additional treatment (Williams et al., supra; Pairoleroet al., supra). The identification of a high-risk group among patientswith stage I disease would lead to consideration of additionaltherapeutic intervention for this group, as well as leading to improvedsurvival of these patients.

There is a need for additional diagnostic and treatment options,particularly treatments customized to a patient's tumor.

SUMMARY OF THE INVENTION

The present disclosure relates to compositions and methods for cancerdiagnosis, research and therapy, including but not limited to, cancermarkers. In particular, the present disclosure relates to cancer markersand panels of cancer markers as diagnostic markers and clinical targetsfor lung cancer. In some embodiments, the present invention providescompositions, kits, systems and methods for determining the likelihoodof survival of a subject based on altered expression of one or morecancer markers.

For example, in some embodiments, the present invention provides a kitfor characterizing cancer (e.g., determining likelihood of survival,likelihood of metastasis (e.g., lymph node metastasis), or likelihood ofadvancement) in a subject diagnosed with lung cancer, comprising:reagents for detection of altered expression of one or more (e.g., 2 ormore, 3 or more, 4 or more, 5 or more, 10 or more, 15 or more or all of)HMGA2, LAMC2, EIF4EBP1, FN1, HSPB1, ULBP2, VEGFA, BMP1, FST, CD59,TNFRSF12A, PDGFB, COL4A3, SERPINE1, SCG2, TGFB1, HMOX1, STC1, XYLT1, orPPP1R14B. In some embodiments, the kits further comprises reagents fordetecting one or more of ADAM19, ANGPTL4, AP1S2, ARPC4, BPGM, CD151,CHST11, CHST3, COL1A1, COL4A1, COL4A2, COL7A1, CRIP2, VCAN, CTGF,CXCL12, CYR61, DSC2, ECM1, EFNA1, EPHB2, FHL2, FSTL1, FSTL3, C11orf41,GALNT2, GSN, IGF1, IGF2, IGFBP5, IGFBP7, IL11, INHBA, ITGA2, ITGA3,JAG1, MGC17330, KIAA1797, LEFTY2, LIF, LTBP1, LTBP2, LTBP3, LTBP4,PIK3IP1, MMP1, MMP10, MMP2, MMP9, MRC2, NPC2, NPTX1, NRG1, PAWR, PCDH1,PDLIM2, PGRMC2, PLAT, PLAUR, PLOD2, PLSCR3, HTRA1, PTPRK, RSU1, SEMA3C,SERPINE2, SPOCK1, TAGLN, TAGLN2, TAX1BP3, TGFBR1, THBS1, TIMP2, TLL2,TNFAIP6, TP53I3, or TUBA4B. In some embodiments, markers are detected ina multiplex or panel format comprising 5 or more, 10 or more, 15 ormore, or all of the aforementioned markers.

The present invention also provides a compostion comprising one or morereaction mixtures (e.g., at least 5, at least 10, at least 15, orcorresponding to all of the genes), each reaction mixture comprising acomplex of a reagent for detection of one or more genes selected from,for example, HMGA2, LAMC2, EIF4EBP1, FN1, HSPB1, ULBP2, VEGFA, BMP1,FST, CD59, TNFRSF12A, PDGFB, COL4A3, SERPINE1, SCG2, TGFB1, HMOX1, STC1,XYLT1, PPP1R14B GPTL4, AP1S2, ARPC4, BPGM, CD151, CHST11, CHST3, COL1A1,COL4A1, COL4A2, COL7A1, CRIP2, VCAN, CTGF, CXCL12, CYR61, DSC2, ECM1,EFNA1, EPHB2, FHL2, FSTL1, FSTL3, C11orf41, GALNT2, GSN, IGF1, IGF2,IGFBP5, IGFBP7, IL11, INHBA, ITGA2, ITGA3, JAG1, MGC17330, KIAA1797,LEFTY2, LIF, LTBP1, LTBP2, LTBP3, LTBP4, PIK3IP1, MMP1, MMP10, MMP2,MMP9, MRC2, NPC2, NPTX1, NRG1, PAWR, PCDH1, PDLIM2, PGRMC2, PLAT, PLAUR,PLOD2, PLSCR3, HTRA1, PTPRK, RSU1, SEMA3C, SERPINE2, SPOCK1, TAGLN,TAGLN2, TAX1BP3, TGFBR1, THBS1, TIMP2, TLL2, TNFAIP6, TP53I3, or TUBA4Bbound to the gene. In some embodiments, the reagents are, for example,nucleic acid probes that bind to a nucleic acid encoding the gene, apair of amplification primers that bind to a nucleic acid encoding thegene, a sequencing primer that binds to a nucleic acid encoding thegene, and an antibody that binds to a polypeptide encoded by the gene.

In other embodiments, the present invention provides methods forcharacterizing cancer (e.g., determining likelihood of survival,likelihood of metastasis (e.g., lymph node metastasis), or likelihood ofadvancement) in a subject diagnosed with lung cancer, comprising:contacting a sample from a subject diagnosed with lung with reagents fordetection of altered expression of one or more (e.g., 2 or more, 3 ormore, 4 or more, 5 or more, 10 or more, 15 or more or all of) HMGA2,LAMC2, EIF4EBP1, FN1, HSPB1, ULBP2, VEGFA, BMP1, FST, CD59, TNFRSF12A,PDGFB, COL4A3, SERPINE1, SCG2, TGFB1, HMOX1, STC1, XYLT1, or PPP1R14B.In some embodiments, the method further comprises detecting one or moreof ADAM19, ANGPTL4, AP1S2, ARPC4, BPGM, CD151, CHST11, CHST3, COL1A1,COL4A1, COL4A2, COL7A1, CRIP2, VCAN, CTGF, CXCL12, CYR61, DSC2, ECM1,EFNA1, EPHB2, FHL2, FSTL1, FSTL3, C11orf41, GALNT2, GSN, IGF1, IGF2,IGFBP5, IGFBP7, IL11, INHBA, ITGA2, ITGA3, JAG1, MGC17330, KIAA1797,LEFTY2, LIF, LTBP1, LTBP2, LTBP3, LTBP4, PIK3IP1, MMP1, MMP10, MMP2,MMP9, MRC2, NPC2, NPTX1, NRG1, PAWR, PCDH1, PDLIM2, PGRMC2, PLAT, PLAUR,PLOD2, PLSCR3, HTRA1, PTPRK, RSU1, SEMA3C, SERPINE2, SPOCK1, TAGLN,TAGLN2, TAX1BP3, TGFBR1, THBS1, TIMP2, TLL2, TNFAIP6, TP5313, or TUBA4B.In some embodiments, the lung cancer is lung adenocarcinoma or squamouscell carcinoma. In some embodiments, the lung cancer is early stage lungcancer or advanced lung cancer.

In some embodiments, the present invention provides a method of a)characterizing cancer in a subject using any of the aforementioned kitsand methods; and b) determining a treatment course of action based onsaid characterizing (e.g., choice of chemotherapeutic agent, surgery, orradiation treatment); and optionally c) administering said treatment. Insome embodiments, the method further comprises the step of repeatingsaid characterizing step after said treatment step. In some embodiments,the results of the characterizing are used to alter, stop, start, ormodify the treatment course of action.

Further embodiments provide the use of any of the aforementionedcompositions and kits in characterizing cancer in a subject diagnosedwith lung cancer.

Additional embodiments of the present disclosure are provided in thedescription and examples below.

DESCRIPTION OF THE FIGURES

FIG. 1A-B: shows (A) analysis of EMT secretome for enriched biologicalprocesses and cellular components and biological processes. (B) clusteranalysis of EASP signature and other cancer related pathways in 442 lungadenocarcinomas.

FIG. 2A-B shows (A) boxplot of EASP signature expression withdifferentiation, stage and lymph nodal status. (B) testing of 97-geneEASP prognosis signature as a predictor of lung cancer patient'ssurvival.

FIG. 3 shows testing of 20-gene rEASP prognosis signature as a predictorof lung cancer patient's survival.

FIG. 4 shows variable importance scores (VIMP) of top 20 genes used inRandom survival forest (RSF) in Shedden 442 training set.

DEFINITIONS

Unless defined otherwise, all terms of art, notations and otherscientific terms or terminology used herein have the same meaning as iscommonly understood by one of ordinary skill in the art to which thisdisclosure belongs. Many of the techniques and procedures described orreferenced herein are well understood and commonly employed usingconventional methodology by those skilled in the art. As appropriate,procedures involving the use of commercially available kits and reagentsare generally carried out in accordance with manufacturer definedprotocols and/or parameters unless otherwise noted. All patents,applications, published applications and other publications referred toherein are incorporated by reference in their entirety. If a definitionset forth in this section is contrary to or otherwise inconsistent witha definition set forth in the patents, applications, publishedapplications, and other publications that are herein incorporated byreference, the definition set forth in this section prevails over thedefinition that is incorporated herein by reference.

As used herein, “a” or “an” means “at least one” or “one or more.”

As used herein, the term “gene upregulated in cancer” refers to a genethat is expressed (e.g., mRNA or protein expression) at a higher levelin cancer (e.g., lung cancer) relative to the level in other tissue. Inthis context, “other tissue” may refer to, for example, tissues fromdifferent organs in the same subject or to normal tissues of the same ordifferent type. In some embodiments, genes upregulated in cancer areexpressed at a level between at least 10% to 300% higher than the levelof expression in other tissue. For example, genes upregulated in cancerare frequently expressed at a level preferably at least 25%, at least50%, at least 100%, at least 200%, or at least 300% higher than thelevel of expression in other tissue.

As used herein, the term “gene upregulated in lung tissue” or “genedownregulated in lung cancer” refers to a gene that is expressed (e.g.,mRNA or protein expression) at a higher or lower level in tissueobtained from lung (e.g., lung caner tissue or cell) relative to thelevel in other tissue (e.g., non-cancerous lung tissue or non-lungtissue). In some embodiments, genes upregulated in lung tissue areexpressed at a level between at least 10% to 300%. For example, genesupregulated in cancer are frequently expressed at a level preferably atleast 25%, at least 50%, at least 100%, at least 200%, or at least 300%higher than the level of expression in other tissues. In someembodiments, genes upregulated in lung tissue are exclusively expressedin lung tissue.

As used herein, the terms “detect”, “detecting” or “detection” maydescribe either the general act of discovering or discerning or thespecific observation of a detectably labeled composition.

As used herein, the term “stage of cancer” refers to a qualitative orquantitative assessment of the level of advancement of a cancer.Criteria used to determine the stage of a cancer include, but are notlimited to, the size of the tumor and the extent of metastases (e.g.,localized or distant).

As used herein, the term “nucleic acid molecule” refers to any nucleicacid containing molecule, including but not limited to, DNA or RNA. Theterm encompasses sequences that include any of the known base analogs ofDNA and RNA including, but not limited to, 4-acetylcytosine,8-hydroxy-N6-methyladenosine, aziridinylcytosine, pseudoisocytosine,5-(carboxyhydroxylmethyl)uracil, 5-fluorouracil, 5-bromouracil,5-carboxymethylaminomethyl-2-thiouracil,5-carboxymethyl-aminomethyluracil, dihydrouracil, inosine,N6-isopentenyladenine, 1-methyladenine, 1-methylpseudouracil,1-methylguanine, 1-methylinosine, 2,2-dimethylguanine, 2-methyladenine,2-methylguanine, 3-methylcytosine, 5-methylcytosine, N6-methyladenine,7-methylguanine, 5-methylaminomethyluracil,5-methoxyaminomethyl-2-thiouracil, beta-D-mannosylqueosine,5′-methoxycarbonylmethyluracil, 5-methoxyuracil,2-methylthio-N6-isopentenyladenine, uracil-5-oxyacetic acid methylester,uracil-5-oxyacetic acid, oxybutoxosine, pseudouracil, queosine,2-thiocytosine, 5-methyl-2-thiouracil, 2-thiouracil, 4-thiouracil,5-methyluracil, N-uracil-5-oxyacetic acid methylester,uracil-5-oxyacetic acid, pseudouracil, queosine, 2-thiocytosine, and2,6-diaminopurine.

The term “gene” refers to a nucleic acid (e.g., DNA) sequence thatcomprises coding sequences necessary for the production of apolypeptide, precursor, or RNA (e.g., rRNA, tRNA). The polypeptide canbe encoded by a full length coding sequence or by any portion of thecoding sequence so long as the desired activity or functional properties(e.g., enzymatic activity, ligand binding, signal transduction,immunogenicity, etc.) of the full-length or fragment are retained. Theterm also encompasses the coding region of a structural gene and thesequences located adjacent to the coding region on both the 5′ and 3′ends for a distance of about 1 kb or more on either end such that thegene corresponds to the length of the full-length mRNA. Sequenceslocated 5′ of the coding region and present on the mRNA are referred toas 5′ non-translated sequences. Sequences located 3′ or downstream ofthe coding region and present on the mRNA are referred to as 3′non-translated sequences. The term “gene” encompasses both cDNA andgenomic forms of a gene. A genomic form or clone of a gene contains thecoding region interrupted with non-coding sequences termed “introns” or“intervening regions” or “intervening sequences.” Introns are segmentsof a gene that are transcribed into nuclear RNA (hnRNA); introns maycontain regulatory elements such as enhancers. Introns are removed or“spliced out” from the nuclear or primary transcript; introns thereforeare absent in the messenger RNA (mRNA) transcript. The mRNA functionsduring translation to specify the sequence or order of amino acids in anascent polypeptide.

As used herein, the term “oligonucleotide,” refers to a short length ofsingle-stranded polynucleotide chain. Oligonucleotides are typicallyless than 200 residues long (e.g., between 15 and 100), however, as usedherein, the term is also intended to encompass longer polynucleotidechains. Oligonucleotides are often referred to by their length. Forexample a 24 residue oligonucleotide is referred to as a “24-mer”.Oligonucleotides can form secondary and tertiary structures byself-hybridizing or by hybridizing to other polynucleotides. Suchstructures can include, but are not limited to, duplexes, hairpins,cruciforms, bends, and triplexes.

As used herein, the term “probe” refers to an oligonucleotide, whetheroccurring naturally as in a purified restriction digest or producedsynthetically, recombinantly or by PCR amplification, which is capableof hybridizing to at least a portion of another oligonucleotide ofinterest. A probe may be single-stranded or double-stranded. Probes areuseful in the detection, identification and isolation of particular genesequences. It is contemplated that any probe used in methods of thepresent disclosure will be labeled with any “reporter molecule,” so thatis detectable in any detection system, including, but not limited toenzyme (e.g., ELISA, as well as enzyme-based histochemical assays),fluorescent, radioactive, and luminescent systems. It is not intendedthat the methods or reagents of the present disclosure be limited to anyparticular detection system or label.

The term “isolated” when used in relation to a nucleic acid, as in “anisolated oligonucleotide” or “isolated polynucleotide” refers to anucleic acid sequence that is identified and separated from at least onecomponent or contaminant with which it is ordinarily associated in itsnatural source. An isolated nucleic acid is present in a form or settingthat is different from that in which it is found in nature. In contrast,non-isolated nucleic acids are found in the state they exist in nature.For example, a given DNA sequence (e.g., a gene) is found on the hostcell chromosome in proximity to neighboring genes; RNA sequences, suchas a specific mRNA sequence encoding a specific protein, are found inthe cell as a mixture with numerous other mRNAs that encode a multitudeof proteins. However, isolated nucleic acid encoding a given proteinincludes, by way of example, such nucleic acid in cells ordinarilyexpressing the given protein where the nucleic acid is in a chromosomallocation different from that of natural cells, or is otherwise flankedby a different nucleic acid sequence than that found in nature. Theisolated nucleic acid, oligonucleotide, or polynucleotide may be presentin single-stranded or double-stranded form. When an isolated nucleicacid, oligonucleotide or polynucleotide is to be utilized to express aprotein, the nucleic acid, oligonucleotide or polynucleotide often willcontain, at a minimum, the sense or coding strand (i.e., theoligonucleotide or polynucleotide may be single-stranded), but maycontain both the sense and anti-sense strands (i.e., the oligonucleotideor polynucleotide may be double-stranded).

As used herein, the term “purified” or “to purify” refers to the removalof components (e.g., contaminants) from a sample. For example,antibodies are purified by removal of contaminating non-immunoglobulinproteins; they are also purified by the removal of immunoglobulin thatdoes not bind to the target molecule. The removal of non-immunoglobulinproteins and/or the removal of immunoglobulins that do not bind to thetarget molecule results in an increase in the percent of target-reactiveimmunoglobulins in the sample. In another example, recombinantpolypeptides are expressed in bacterial host cells and the polypeptidesare purified by the removal of host cell proteins; the percent ofrecombinant polypeptides is thereby increased in the sample.

As used herein, the term “sample” is used in its broadest sense. In onesense, it is meant to include a specimen or culture obtained from anysource, as well as biological and environmental samples. Biologicalsamples may be obtained from animals (including humans) and encompassfluids, solids, tissues (e.g., lung tissue biopsy, lavage fluid, exhaledair, etc.), and gases. Biological samples include blood products, suchas plasma, serum and the like. Such examples are not however to beconstrued as limiting the sample types applicable to the presentinvention.

DETAILED DESCRIPTION OF THE INVENTION

The present disclosure relates to compositions and methods for cancerdiagnosis, research and therapy, including but not limited to, cancermarkers. In particular, the present disclosure relates to cancer markersas diagnostic markers and clinical targets for lung cancer.

Lung cancer is the leading cause of cancer related death around theworld. The advances made in last decade in diagnosis and treatment didnot translate into significant overall 5-year survival rates. Thetumor-node-metastasis (TNM) staging system combined with pathologicdiagnosis has remained the major tool for medical decision making andpredicting patient survival (Detterbeck et al., Chest, 2009. 136(1): p.260-71; Arribalzaga et al., J Thorac Oncol, 2009. 4(10): p. 1301; authorreply 1301-2.). However, accumulating evidence indicates that thoughpatients with identical histology, differentiation, location and stageat diagnosis are treated by similar therapy, the survival is mostheterogeneous indicating the current methods of tumor classification andstaging are not enough for selecting best treatment choice andprognosis. 30-55% of early stage patients who are treated primarily bysurgery will have recurrence within 3 years. Recent randomized clinicaltrails revealed a significant survival advantage in patients receivingchemotherapy after complete resection in the stage IB-IIIA categories(Visbal et al., Chest, 2005. 128(4): p. 2933-43; Waller et al., Eur JCardiothorac Surg, 2004. 26(1): p. 173-82; Domont et al., Semin Oncol,2005. 32(3): p. 279-83; Azzoli, Nat Clin Pract Oncol, 2005. 2(11): p.552-3). This trend indicates a need to explore alternative indicators tounderstand the underlying prognosis of a given patient, identify theearly stage patients at the greater risk of relapse and decide onappropriate treatment strategy that would optimize patient survival.

Experiments conducted during the course of development of embodiments ofthe present disclosure developed a prognostic gene signature based onthe cellular process of epithelial-mesenchymal transition (EMT), whichplays a critical role in tumor progression. EMT is considered as aninitiating event for distant dissemination of tumor cells and confersmany clinically relevant properties to cancer cells, including migratoryand invasive capacity, resistance to apoptosis, drug resistance, evasionof host immune surveillance, and tumor stem cell traits. Cellsundergoing EMT represent tumor cells with metastatic potential.Therefore, characterizing the secretome of cells in EMT identifiesbiomarkers that allow monitoring of EMT in tumor progression and providea prognostic signature to predict recurrance and survival, particularlyin early stage patients.

Utilizing a TGF-β-induced EMT model, differentially secreted proteinswere profiled by GeLC-MS/MS and spectral counting, in the conditionedmedia of A549 lung adenocarcinoma cell line cultured in the presence andabsence of TGF-β. By integrating the EMT secretome and the EMT geneexpression data from our earlier study (GSE17708) (Slodkowska), a 97gene EMT Associated Secretory Phenotype (EASP) signature that showedstrong correlation to differentiation and stage and predicted survivalof lung adenocarcinoma patients in training and independent test setswas identified. This set was further refined to a 20 gene signature(rEASP), which performed equally well in predicting survival inexclusively early stage (Stage I & II) adenocarcinomas as well assquamous cell carcinomas of lung. Meta-analysis on different lung cancergene expression data sets clearly established the effectiveness of the20 gene signature in stratifying the lung cancer patients into low,medium and high risk groups with distinct survival times.

Accordingly, in some embodiments, the present invention provides cancermarkers and panels of cancer markers for the research, screening andclinical (e.g., prediction of patient survival with early stage lungcancer) applications.

I. Cancer Markers

In some embodiments, the present invention provides cancer markers whosealtered expression (e.g., relative to the level of expression in anon-cancerous lung sample) is indicative of cancer (e.g., lung cancer).For example, in some embodiments, the cancer marker comprises one ormore (e.g., 2 or more, 3 or more, 4 or more, 5 or more, 10 or more, 15or more, or all of) HMGA2, LAMC2, EIF4EBP1, FN1, HSPB1, ULBP2, VEGFA,BMP1, FST, CD59, TNFRSF12A, PDGFB, COL4A3, SERPINE1, SCG2, TGFB1, HMOX1,STC1, XYLT1, or PPP1R14B. In some embodiments, one or more additionalmarkers selected from ADAM19, ANGPTL4, AP1S2, ARPC4, BPGM, CD151,CHST11, CHST3, COL1A1, COL4A1, COL4A2, COL7A1, CRIP2, VCAN, CTGF,CXCL12, CYR61, DSC2, ECM1, EFNA1, EPHB2, FHL2, FSTL1, FSTL3, C11orf41,GALNT2, GSN, IGF1, IGF2, IGFBP5, IGFBP7, IL11, INHBA, ITGA2, ITGA3,JAG1, MGC17330, KIAA1797, LEFTY2, LIF, LTBP1, LTBP2, LTBP3, LTBP4,PIK3IP1, MMP1, MMP10, MMP2, MMP9, MRC2, NPC2, NPTX1, NRG1, PAWR, PCDH1,PDLIM2, PGRMC2, PLAT, PLAUR, PLOD2, PLSCR3, HTRA1, PTPRK, RSU1, SEMA3C,SERPINE2, SPOCK1, TAGLN, TAGLN2, TAX1BP3, TGFBR1, THBS1, TIMP2, TLL2,TNFAIP6, TP53I3, or TUBA4B are detected. Sequences of the genes can befound, for example, in the GenBank database (NCBI). In some embodiments,expression of the marker is increased or decreased relative to the levelin a non-cancerous lung sample (e.g., 5%, 10%, 25%, 50%, 75%, 100% ormore altered expression).

In some embodiments, genes for inclusion in the panel are selected basedon their ability to characterize cancer (e.g., based on over or underexpression of the marker). In some embodiments, statistical techniques(e.g., those described in the experimental section below) are utilizedto select the predictive value of genes or panels of genes. In someembodiments, panels are selected for their collective predictive valueusing any number of statistical techniques (e.g., those describedherein).

In some embodiments, markers are detected in a multiplex or panel formatcomprising 5 or more, 10 or more, 25 or more, 50 or more or all of theaforementioned markers.

II. Antibodies

The cancer marker proteins of the present disclosure, includingfragments, derivatives and analogs thereof, may be used as immunogens toproduce antibodies having use in the diagnostic, screening, research,and therapeutic methods described hereain. The antibodies may bepolyclonal or monoclonal, chimeric, humanized, single chain, Fv or Fabfragments. Various procedures can be used for the production andlabeling of such antibodies and fragments. See+, e.g., Burns, ed.,Immunochemical Protocols, 3^(rd) ed., Humana Press (2005); Harlow andLane, Antibodies: A Laboratory Manual, Cold Spring Harbor Laboratory(1988); Kozbor et al., Immunology Today 4: 72 (1983); Köhler andMilstein, Nature 256: 495 (1975).

III. Diagnostic and Screening Applications

Expression levels of the cancer may be detectable as DNA, RNA orprotein. The present disclosure provides RNA and protein baseddiagnostic and screening methods that detect the expression levels ofthe cancer markers described herein. The present disclosure alsoprovides compositions and kits for diagnostic and screening purposes.

A. Sample

Any sample suspected of containing the cancer markers may be testedaccording to the methods of the present disclosure. By way ofnon-limiting example, the sample may be tissue (e.g., a lung biopsysample), lung related samples (e.g., lavage fluid, exhaled air, sputem,etc.) blood, cell secretions or a fraction thereof (e.g., plasma, serum,exosomes, etc.).

The patient sample typically involves preliminary processing designed toisolate or enrich the sample for the cancer marker(s) or cells thatcontain the cancer marker(s). A variety of techniques can be used forthis purpose, including but not limited to: centrifugation;immunocapture; cell lysis; and, nucleic acid target capture.

B. Detection of RNA

In some preferred embodiments, detection of lung cancer markers (e.g.,including but not limited to, those disclosed herein) is detected bymeasuring the expression of corresponding mRNA in a tissue sample (e.g.,lung tissue). mRNA expression may be measured by any suitable method,including but not limited to, those disclosed below.

In some embodiments, RNA is detection by Northern blot analysis.Northern blot analysis involves the separation of RNA and hybridizationof a complementary labeled probe. An exemplary method for Northern blotanalysis is provided in Example 3.

In still further embodiments, RNA (or corresponding cDNA) is detected byhybridization to a oligonucleotide probe). A variety of hybridizationassays using a variety of technologies for hybridization and detectionare available. For example, in some embodiments, TaqMan assay (PEBiosystems, Foster City, Calif.; See e.g., U.S. Pat. Nos. 5,962,233 and5,538,848, each of which is herein incorporated by reference) isutilized. The assay is performed during a PCR reaction. The TaqMan assayexploits the 5′-3′ exonuclease activity of the AMPLITAQ GOLD DNApolymerase. A probe consisting of an oligonucleotide with a 5′-reporterdye (e.g., a fluorescent dye) and a 3′-quencher dye is included in thePCR reaction. During PCR, if the probe is bound to its target, the 5′-3′nucleolytic activity of the AMPLITAQ GOLD polymerase cleaves the probebetween the reporter and the quencher dye. The separation of thereporter dye from the quencher dye results in an increase offluorescence. The signal accumulates with each cycle of PCR and can bemonitored with a fluorimeter.

In some embodiments, microarrays including, but not limited to: DNAmicroarrays (e.g., cDNA microarrays and oligonucleotide microarrays);protein microarrays; tissue microarrays; transfection or cellmicroarrays; chemical compound microarrays; and, antibody microarraysare utilized for measuring cancer marker mRNA levels. A DNA microarray,commonly known as gene chip, DNA chip, or biochip, is a collection ofmicroscopic DNA spots attached to a solid surface (e.g., glass, plasticor silicon chip) forming an array for the purpose of expressionprofiling or monitoring expression levels for thousands of genessimultaneously. The affixed DNA segments are known as probes, thousandsof which can be used in a single DNA microarray. Microarrays can be usedto identify disease genes by comparing gene expression in disease andnormal cells. Microarrays can be fabricated using a variety oftechnologies, including but not limited to: printing with fine-pointedpins onto glass slides; photolithography using pre-made masks;photolithography using dynamic micromirror devices; ink jet printing;or, electrochemistry on microelectrode arrays.

In yet other embodiments, reverse-transcriptase PCR (RT-PCR) is used todetect the expression of RNA. In RT-PCR, RNA is enzymatically convertedto complementary DNA or “cDNA” using a reverse transcriptase enzyme. ThecDNA is then used as a template for a PCR reaction. PCR products can bedetected by any suitable method, including but not limited to, gelelectrophoresis and staining with a DNA specific stain or hybridizationto a labeled probe. In some embodiments, the quantitative reversetranscriptase PCR with standardized mixtures of competitive templatesmethod described in U.S. Pat. Nos. 5,639,606, 5,643,765, and 5,876,978(each of which is herein incorporated by reference) is utilized.

In some embodiments, the cancer markers are detected by hybridizationwith a detectably labeled probe and measurement of the resultinghybrids. Illustrative non-limiting examples of detection methods aredescribed below.

One illustrative detection method, the Hybridization Protection Assay(HPA) involves hybridizing a chemiluminescent oligonucleotide probe(e.g., an acridinium ester-labeled (AE) probe) to the target sequence,selectively hydrolyzing the chemiluminescent label present onunhybridized probe, and measuring the chemiluminescence produced fromthe remaining probe in a luminometer. See, e.g., U.S. Pat. No.5,283,174; Nelson et al., Nonisotopic Probing, Blotting, and Sequencing,ch. 17 (Larry J. Kricka ed., 2d ed. 1995, each of which is hereinincorporated by reference in its entirety).

The interaction between two molecules can also be detected, e.g., usingfluorescence energy transfer (FRET) (see, for example, Lakowicz et al.,U.S. Pat. No. 5,631,169; Stavrianopoulos et al., U.S. Pat. No.4,968,103; each of which is herein incorporated by reference). Afluorophore label is selected such that a first donor molecule's emittedfluorescent energy will be absorbed by a fluorescent label on a second,‘acceptor’ molecule, which in turn is able to fluoresce due to theabsorbed energy.

Alternately, the ‘donor’ protein molecule may simply utilize the naturalfluorescent energy of tryptophan residues. Labels are chosen that emitdifferent wavelengths of light, such that the ‘acceptor’ molecule labelmay be differentiated from that of the ‘donor’. Since the efficiency ofenergy transfer between the labels is related to the distance separatingthe molecules, the spatial relationship between the molecules can beassessed. In a situation in which binding occurs between the molecules,the fluorescent emission of the ‘acceptor’ molecule label should bemaximal. A FRET binding event can be conveniently measured throughfluorometric detection means.

Another example of a detection probe having self-complementarity is a“molecular beacon.” Molecular beacons include nucleic acid moleculeshaving a target complementary sequence, an affinity pair (or nucleicacid arms) holding the probe in a closed conformation in the absence ofa target sequence present in an amplification reaction, and a label pairthat interacts when the probe is in a closed conformation. Hybridizationof the target sequence and the target complementary sequence separatesthe members of the affinity pair, thereby shifting the probe to an openconformation. The shift to the open conformation is detectable due toreduced interaction of the label pair, which may be, for example, afluorophore and a quencher (e.g., DABCYL and EDANS). Molecular beaconsare disclosed, for example, in U.S. Pat. Nos. 5,925,517 and 6,150,097,herein incorporated by reference in its entirety.

By way of non-limiting example, probe binding pairs having interactinglabels, such as those disclosed in U.S. Pat. No. 5,928,862 (hereinincorporated by reference in its entirety) might be adapted for use inmethod of embodiments of the present disclosure. Probe systems used todetect single nucleotide polymorphisms (SNPs) might also be utilized inthe present invention. Additional detection systems include “molecularswitches,” as disclosed in U.S. Publ. No. 20050042638, hereinincorporated by reference in its entirety. Other probes, such as thosecomprising intercalating dyes and/or fluorochromes, are also useful fordetection of amplification products methods of embodiments of thepresent disclosure. See, e.g., U.S. Pat. No. 5,814,447 (hereinincorporated by reference in its entirety).

In some embodiments, nucleic acid sequencing methods are utilized fordetection. In some embodiments, the sequencing is Second Generation(a.k.a. Next Generation or Next-Gen), Third Generation (a.k.a.Next-Next-Gen), or Fourth Generation (a.k.a. N3-Gen) sequencingtechnology including, but not limited to, pyrosequencing,sequencing-by-ligation, single molecule sequencing,sequence-by-synthesis (SBS), semiconductor sequencing, massive parallelclonal, massive parallel single molecule SBS, massive parallel singlemolecule real-time, massive parallel single molecule real-time nanoporetechnology, etc. Morozova and Marra provide a review of some suchtechnologies in Genomics, 92: 255 (2008), herein incorporated byreference in its entirety. Those of ordinary skill in the art willrecognize that because RNA is less stable in the cell and more prone tonuclease attack experimentally RNA is usually reverse transcribed to DNAbefore sequencing.

DNA sequencing techniques include fluorescence-based sequencingmethodologies (See, e.g., Birren et al., Genome Analysis: Analyzing DNA,1, Cold Spring Harbor, N.Y.; herein incorporated by reference in itsentirety). In some embodiments, the sequencing is automated sequencing.In some embodiments, the sequencing is parallel sequencing ofpartitioned amplicons (PCT Publication No: WO2006084132 to KevinMcKernan et al., herein incorporated by reference in its entirety). Insome embodiments, the sequencing is DNA sequencing by paralleloligonucleotide extension (See, e.g., U.S. Pat. No. 5,750,341 toMacevicz et al., and U.S. Pat. No. 6,306,597 to Macevicz et al., both ofwhich are herein incorporated by reference in their entireties).Additional examples of sequencing techniques include the Church polonytechnology (Mitra et al., 2003, Analytical Biochemistry 320, 55-65;Shendure et al., 2005 Science 309, 1728-1732; U.S. Pat. No. 6,432,360,U.S. Pat. No. 6,485,944, U.S. Pat. No. 6,511,803; herein incorporated byreference in their entireties), the 454 picotiter pyrosequencingtechnology (Margulies et al., 2005 Nature 437, 376-380; US 20050130173;herein incorporated by reference in their entireties), the Solexa singlebase addition technology (Bennett et al., 2005, Pharmacogenomics, 6,373-382; U.S. Pat. No. 6,787,308; U.S. Pat. No. 6,833,246; hereinincorporated by reference in their entireties), the Lynx massivelyparallel signature sequencing technology (Brenner et al. (2000). Nat.Biotechnol. 18:630-634; U.S. Pat. No. 5,695,934; U.S. Pat. No.5,714,330; herein incorporated by reference in their entireties), andthe Adessi PCR colony technology (Adessi et al. (2000). Nucleic AcidRes. 28, E87; WO 00018957; herein incorporated by reference in itsentirety).

Next-generation sequencing (NGS) methods share the common feature ofmassively parallel, high-throughput strategies, with the goal of lowercosts in comparison to older sequencing methods (see, e.g., Voelkerdinget al., Clinical Chem., 55: 641-658, 2009; MacLean et al., Nature Rev.Microbiol., 7: 287-296; each herein incorporated by reference in theirentirety). NGS methods can be broadly divided into those that typicallyuse template amplification and those that do not.Amplification-requiring methods include pyrosequencing commercialized byRoche as the 454 technology platforms (e.g., GS 20 and GS FLX), LifeTechnologies/Ion Torrent, the Solexa platform commercialized byIllumina, GnuBio, and the Supported Oligonucleotide Ligation andDetection (SOLiD) platform commercialized by Applied Biosystems.Non-amplification approaches, also known as single-molecule sequencing,are exemplified by the HeliScope platform commercialized by HelicosBioSciences, and emerging platforms commercialized by VisiGen, OxfordNanopore Technologies Ltd., and Pacific Biosciences, respectively.

In pyrosequencing (Voelkerding et al., Clinical Chem., 55: 641-658,2009; MacLean et al., Nature Rev. Microbiol., 7: 287-296; U.S. Pat. No.6,210,891; U.S. Pat. No. 6,258,568; each herein incorporated byreference in its entirety), template DNA is fragmented, end-repaired,ligated to adaptors, and clonally amplified in-situ by capturing singletemplate molecules with beads bearing oligonucleotides complementary tothe adaptors. Each bead bearing a single template type iscompartmentalized into a water-in-oil microvesicle, and the template isclonally amplified using a technique referred to as emulsion PCR. Theemulsion is disrupted after amplification and beads are deposited intoindividual wells of a picotitre plate functioning as a flow cell duringthe sequencing reactions. Ordered, iterative introduction of each of thefour dNTP reagents occurs in the flow cell in the presence of sequencingenzymes and luminescent reporter such as luciferase. In the event thatan appropriate dNTP is added to the 3′ end of the sequencing primer, theresulting production of ATP causes a burst of luminescence within thewell, which is recorded using a CCD camera. It is possible to achieveread lengths greater than or equal to 400 bases, and 10⁶ sequence readscan be achieved, resulting in up to 500 million base pairs (Mb) ofsequence.

In the Solexa/Illumina platform (Voelkerding et al., Clinical Chem., 55:641-658, 2009; MacLean et al., Nature Rev. Microbiol., 7: 287-296; U.S.Pat. No. 6,833,246; U.S. Pat. No. 7,115,400; U.S. Pat. No. 6,969,488;each herein incorporated by reference in its entirety), sequencing dataare produced in the form of shorter-length reads. In this method,single-stranded fragmented DNA is end-repaired to generate5′-phosphorylated blunt ends, followed by Klenow-mediated addition of asingle A base to the 3′ end of the fragments. A-addition facilitatesaddition of T-overhang adaptor oligonucleotides, which are subsequentlyused to capture the template-adaptor molecules on the surface of a flowcell that is studded with oligonucleotide anchors. The anchor is used asa PCR primer, but because of the length of the template and itsproximity to other nearby anchor oligonucleotides, extension by PCRresults in the “arching over” of the molecule to hybridize with anadjacent anchor oligonucleotide to form a bridge structure on thesurface of the flow cell. These loops of DNA are denatured and cleaved.Forward strands are then sequenced with reversible dye terminators. Thesequence of incorporated nucleotides is determined by detection ofpost-incorporation fluorescence, with each fluor and block removed priorto the next cycle of dNTP addition. Sequence read length ranges from 36nucleotides to over 250 nucleotides, with overall output exceeding 1billion nucleotide pairs per analytical run.

Sequencing nucleic acid molecules using SOLiD technology (Voelkerding etal., Clinical Chem., 55: 641-658, 2009; MacLean et al., Nature Rev.Microbiol., 7: 287-296; U.S. Pat. No. 5,912,148; U.S. Pat. No.6,130,073; each herein incorporated by reference in their entirety) alsoinvolves fragmentation of the template, ligation to oligonucleotideadaptors, attachment to beads, and clonal amplification by emulsion PCR.Following this, beads bearing template are immobilized on a derivatizedsurface of a glass flow-cell, and a primer complementary to the adaptoroligonucleotide is annealed. However, rather than utilizing this primerfor 3′ extension, it is instead used to provide a 5′ phosphate group forligation to interrogation probes containing two probe-specific basesfollowed by 6 degenerate bases and one of four fluorescent labels. Inthe SOLiD system, interrogation probes have 16 possible combinations ofthe two bases at the 3′ end of each probe, and one of four fluors at the5′ end. Fluor color, and thus identity of each probe, corresponds tospecified color-space coding schemes. Multiple rounds (usually 7) ofprobe annealing, ligation, and fluor detection are followed bydenaturation, and then a second round of sequencing using a primer thatis offset by one base relative to the initial primer. In this manner,the template sequence can be computationally re-constructed, andtemplate bases are interrogated twice, resulting in increased accuracy.Sequence read length averages 35 nucleotides, and overall output exceeds4 billion bases per sequencing run.

In certain embodiments, sequencing is nanopore sequencing (see, e.g.,Astier et al., J. Am. Chem. Soc. 2006 Feb. 8; 128(5):1705-10, hereinincorporated by reference). The theory behind nanopore sequencing has todo with what occurs when a nanopore is immersed in a conducting fluidand a potential (voltage) is applied across it. Under these conditions aslight electric current due to conduction of ions through the nanoporecan be observed, and the amount of current is exceedingly sensitive tothe size of the nanopore. As each base of a nucleic acid passes throughthe nanopore, this causes a change in the magnitude of the currentthrough the nanopore that is distinct for each of the four bases,thereby allowing the sequence of the DNA molecule to be determined.

In certain embodiments, sequencing is HeliScope by Helicos BioSciences(Voelkerding et al., Clinical Chem., 55: 641-658, 2009; MacLean et al.,Nature Rev. Microbiol., 7: 287-296; U.S. Pat. No. 7,169,560; U.S. Pat.No. 7,282,337; U.S. Pat. No. 7,482,120; U.S. Pat. No. 7,501,245; U.S.Pat. No. 6,818,395; U.S. Pat. No. 6,911,345; U.S. Pat. No. 7,501,245;each herein incorporated by reference in their entirety). Template DNAis fragmented and polyadenylated at the 3′ end, with the final adenosinebearing a fluorescent label. Denatured polyadenylated template fragmentsare ligated to poly(dT) oligonucleotides on the surface of a flow cell.Initial physical locations of captured template molecules are recordedby a CCD camera, and then label is cleaved and washed away. Sequencingis achieved by addition of polymerase and serial addition offluorescently-labeled dNTP reagents. Incorporation events result influor signal corresponding to the dNTP, and signal is captured by a CCDcamera before each round of dNTP addition. Sequence read length rangesfrom 25-50 nucleotides, with overall output exceeding 1 billionnucleotide pairs per analytical run.

The Ion Torrent technology is a method of DNA sequencing based on thedetection of hydrogen ions that are released during the polymerizationof DNA (see, e.g., Science 327(5970): 1190 (2010); U.S. Pat. Appl. Pub.Nos. 20090026082, 20090127589, 20100301398, 20100197507, 20100188073,and 20100137143, incorporated by reference in their entireties for allpurposes). A microwell contains a template DNA strand to be sequenced.Beneath the layer of microwells is a hypersensitive ISFET ion sensor.All layers are contained within a CMOS semiconductor chip, similar tothat used in the electronics industry. When a dNTP is incorporated intothe growing complementary strand a hydrogen ion is released, whichtriggers a hypersensitive ion sensor. If homopolymer repeats are presentin the template sequence, multiple dNTP molecules will be incorporatedin a single cycle. This leads to a corresponding number of releasedhydrogens and a proportionally higher electronic signal. This technologydiffers from other sequencing technologies in that no modifiednucleotides or optics are used. The per-base accuracy of the Ion Torrentsequencer is ˜99.6% for 50 base reads, with ˜100 Mb to 100 Gb generatedper run. The read-length is 100-300 base pairs. The accuracy forhomopolymer repeats of 5 repeats in length is ˜98%. The benefits of ionsemiconductor sequencing are rapid sequencing speed and low upfront andoperating costs.

In some embodiments, sequencing is the technique developed by StratosGenomics, Inc. and involves the use of Xpandomers. This sequencingprocess typically includes providing a daughter strand produced by atemplate-directed synthesis. The daughter strand generally includes aplurality of subunits coupled in a sequence corresponding to acontiguous nucleotide sequence of all or a portion of a target nucleicacid in which the individual subunits comprise a tether, at least oneprobe or nucleobase residue, and at least one selectively cleavablebond. The selectively cleavable bond(s) is/are cleaved to yield anXpandomer of a length longer than the plurality of the subunits of thedaughter strand. The Xpandomer typically includes the tethers andreporter elements for parsing genetic information in a sequencecorresponding to the contiguous nucleotide sequence of all or a portionof the target nucleic acid. Reporter elements of the Xpandomer are thendetected. Additional details relating to Xpandomer-based approaches aredescribed in, for example, U.S. Pat. Pub No. 20090035777, entitled “HighThroughput Nucleic Acid Sequencing by Expansion,” filed Jun. 19, 2008,which is incorporated herein in its entirety.

Other emerging single molecule sequencing methods include real-timesequencing by synthesis using a VisiGen platform (Voelkerding et al.,Clinical Chem., 55: 641-58, 2009; U.S. Pat. No. 7,329,492; U.S. patentapplication Ser. No. 11/671,956; U.S. patent application Ser. No.11/781,166; each herein incorporated by reference in their entirety) inwhich immobilized, primed DNA template is subjected to strand extensionusing a fluorescently-modified polymerase and florescent acceptormolecules, resulting in detectible fluorescence resonance energytransfer (FRET) upon nucleotide addition.

C. Detection of Protein

In other embodiments, gene expression of cancer markers is detected bymeasuring the expression of the corresponding protein or polypeptide.Protein expression may be detected by any suitable method. In someembodiments, proteins are detected by immunohistochemistry. In otherembodiments, proteins are detected by their binding to an antibodyraised against the protein. The generation of antibodies is describedabove.

Illustrative non-limiting examples of immunoassays include, but are notlimited to: immunoprecipitation; Western blot; ELISA;immunohistochemistry; immunocytochemistry; immunochromatography; flowcytometry; and, immuno-PCR. Polyclonal or monoclonal antibodiesdetectably labeled using various techniques (e.g., colorimetric,fluorescent, chemiluminescent or radioactive labels) are suitable foruse in the immunoassays.

Immunoprecipitation is the technique of precipitating an antigen out ofsolution using an antibody specific to that antigen. The process can beused to identify proteins or protein complexes present in cell extractsby targeting a specific protein or a protein believed to be in thecomplex. The complexes are brought out of solution by insolubleantibody-binding proteins isolated initially from bacteria, such asProtein A and Protein G. The antibodies can also be coupled to sepharosebeads that can easily be isolated out of solution. After washing, theprecipitate can be analyzed using mass spectrometry, Western blotting,or any number of other methods for identifying constituents in thecomplex.

A Western blot, or immunoblot, is a method to detect protein in a givensample of tissue homogenate or extract. It uses gel electrophoresis toseparate denatured proteins by mass. The proteins are then transferredout of the gel and onto a membrane, typically polyvinyldifluoride ornitrocellulose, where they are probed using antibodies specific to theprotein of interest. As a result, researchers can examine the amount ofprotein in a given sample and compare levels between several groups.

An ELISA, short for Enzyme-Linked ImmunoSorbent Assay, is a biochemicaltechnique to detect the presence of an antibody or an antigen in asample. It utilizes a minimum of two antibodies, one of which isspecific to the antigen and the other of which is coupled to an enzyme.The second antibody will cause a chromogenic or fluorogenic substrate toproduce a signal. Variations of ELISA include sandwich ELISA,competitive ELISA, and ELISPOT. Because the ELISA can be performed toevaluate either the presence of antigen or the presence of antibody in asample, it is a useful tool both for determining serum antibodyconcentrations and also for detecting the presence of antigen.

Immunohistochemistry and immunocytochemistry refer to the process oflocalizing proteins in a tissue section or cell, respectively, via theprinciple of antigens in tissue or cells binding to their respectiveantibodies. Visualization is enabled by tagging the antibody with colorproducing or fluorescent tags. Typical examples of color tags include,but are not limited to, horseradish peroxidase and alkaline phosphatase.Typical examples of fluorophore tags include, but are not limited to,fluorescein isothiocyanate (FITC) or phycoerythrin (PE).

Flow cytometry is a technique for counting, examining and optionallysorting microscopic particles or cells suspended in a stream of fluid.It allows simultaneous multiparametric analysis of the physical and/orchemical characteristics of single cells flowing through anoptical/electronic detection apparatus. A beam of light (e.g., a laser)of a single frequency or color is directed onto a hydrodynamicallyfocused stream of fluid. A number of detectors are aimed at the pointwhere the stream passes through the light beam; one in line with thelight beam (Forward Scatter or FSC) and several perpendicular to it(Side Scatter (SSC) and one or more fluorescent detectors). Eachsuspended particle passing through the beam scatters the light in someway, and fluorescent chemicals in the particle may be excited intoemitting light at a lower frequency than the light source. Thecombination of scattered and fluorescent light is picked up by thedetectors, and by analyzing fluctuations in brightness at each detector,one for each fluorescent emission peak, it is possible to deduce variousfacts about the physical and chemical structure of each individualparticle. FSC correlates with the cell volume and SSC correlates withthe density or inner complexity of the particle (e.g., shape of thenucleus, the amount and type of cytoplasmic granules or the membraneroughness).

Immuno-polymerase chain reaction (IPCR) utilizes nucleic acidamplification techniques to increase signal generation in antibody-basedimmunoassays. Because no protein equivalence of PCR exists, that is,proteins cannot be replicated in the same manner that nucleic acid isreplicated during PCR, the only way to increase detection sensitivity isby signal amplification. The target proteins are bound to antibodieswhich are directly or indirectly conjugated to oligonucleotides. Unboundantibodies are washed away and the remaining bound antibodies have theiroligonucleotides amplified. Protein detection occurs via detection ofamplified oligonucleotides using standard nucleic acid detectionmethods, including real-time methods.

In other embodiments, the immunoassay described in U.S. Pat. Nos.5,599,677 and 5,672,480; each of which is herein incorporated byreference.

D. Data Analysis

In some embodiments, a computer-based analysis program is used totranslate the raw data generated by the detection assay (e.g., thepresence, absence, or amount of a given marker or markers) into data ofpredictive value for a clinician. The clinician can access thepredictive data using any suitable means. Thus, in some preferredembodiments, the present invention provides the further benefit that theclinician, who is not likely to be trained in genetics or molecularbiology, need not understand the raw data. The data is presenteddirectly to the clinician in its most useful form. The clinician is thenable to immediately utilize the information in order to optimize thecare of the subject.

The present invention contemplates any method capable of receiving,processing, and transmitting the information to and from laboratoriesconducting the assays, information provides, medical personal, andsubjects. For example, in some embodiments of the present invention, asample (e.g., a biopsy or a serum or other sample) is obtained from asubject and submitted to a profiling service (e.g., clinical lab at amedical facility, genomic profiling business, etc.), located in any partof the world (e.g., in a country different than the country where thesubject resides or where the information is ultimately used) to generateraw data. Where the sample comprises a tissue or other biologicalsample, the subject may visit a medical center to have the sampleobtained and sent to the profiling center, or subjects may collect thesample themselves (e.g., a sputum sample) and directly send it to aprofiling center. Where the sample comprises previously determinedbiological information, the information may be directly sent to theprofiling service by the subject (e.g., an information card containingthe information may be scanned by a computer and the data transmitted toa computer of the profiling center using an electronic communicationsystems). Once received by the profiling service, the sample isprocessed and a profile is produced (i.e., expression data), specificfor the diagnostic or prognostic information desired for the subject.

The profile data is then prepared in a format suitable forinterpretation by a treating clinician. For example, rather thanproviding raw expression data, the prepared format may represent adiagnosis or risk assessment (e.g., likelihood of long term survival)for the subject, along with recommendations for particular treatmentoptions. The data may be displayed to the clinician by any suitablemethod. For example, in some embodiments, the profiling servicegenerates a report that can be printed for the clinician (e.g., at thepoint of care) or displayed to the clinician on a computer monitor.

In some embodiments, the information is first analyzed at the point ofcare or at a regional facility. The raw data is then sent to a centralprocessing facility for further analysis and/or to convert the raw datato information useful for a clinician or patient. The central processingfacility provides the advantage of privacy (all data is stored in acentral facility with uniform security protocols), speed, and uniformityof data analysis. The central processing facility can then control thefate of the data following treatment of the subject. For example, usingan electronic communication system, the central facility can providedata to the clinician, the subject, or researchers.

In some embodiments, the subject is able to directly access the datausing the electronic communication system. The subject may chose furtherintervention or counseling based on the results. In some embodiments,the data is used for research use. For example, the data may be used tofurther optimize the inclusion or elimination of markers as usefulindicators of a particular condition or stage of disease.

E. Kits

In yet other embodiments, the present invention provides kits,compositions, and systems for the detection and characterization of lungcancer (e.g., by detecting the presence, absence, or level of expressionof one or more of the cancer markers described herein). In someembodiments, the kits contain antibodies specific for one or more cancermarkers, in addition to detection reagents and buffers. In otherembodiments, the kits contain reagents specific for the detection ofmRNA, cDNA or protein (e.g., oligonucleotide probes, primers,antibodies, optionally in an array format). In some embodiments, thekits comprise a plurality of probes, primer pairs, or sequencing primersfor multiplex detection of one or more (e.g., all) of the cancer markersdescribed herein. For example, in some embodiments, kits or systemscomprise a plurality of distinct probes, primer pairs, or sequencingprimers, each of which detects a different cancer marker describedherein. In preferred embodiments, the kits contain all of the componentsnecessary, sufficient or useful to perform a detection assay, includingall controls, directions for performing assays, and any necessarysoftware for analysis and presentation of results.

In some embodiments, kits and systems comprise computer systems (e.g.,comprising computer processors, display screens, portable electronics,and the like) for collecting and analyzing data, as well as providingand displaying information to user (e.g., characterization of lungcancer). Systems and methods for analyzing data are described above.

In some embodiments, the present disclosure provides samples comprisinga plurality of reaction mixtures each comprising one or more nucleicacids encoding a cancer marker described herein or a cancer markerpolypeptide bound to a detection reagent (e.g., probe, primer pair,sequencing primer, antibody, etc.). In some embodiments, each reactionmixture is a reaction mixture for detection of a distince cancer markernucleic acid or polypeptide.

IV. Drug Screening Applications

In some embodiments, the present disclosure provides drug screeningassays (e.g., to screen for anticancer drugs). The screening methods ofthe present disclosure utilize cancer markers described herein alone orin combination with other markers. For example, in some embodiments, thepresent disclosure provides methods of screening for compounds thatalter (e.g., increase or decrease) the expression of cancer markers. Thecompounds or agents may interfere with transcription, by interacting,for example, with the promoter region. The compounds or agents mayinterfere with mRNA. The compounds or agents may interfere with pathwaysthat are upstream or downstream of the biological activity of the cancermarker. In some embodiments, candidate compounds are antisense orinterfering RNA agents (e.g., oligonucleotides) directed against cancermarkers or other pathyway components. In other embodiments, candidatecompounds are antibodies or small molecules that specifically bind to acancer marker regulator or expression products of the present disclosureand inhibit its biological function.

In one screening method, candidate compounds are evaluated for theirability to alter cancer marker expression by contacting a compound witha cell or subject expressing a cancer marker and then assaying for theeffect of the candidate compounds on expression. In some embodiments,the effect of candidate compounds on expression of a cancer marker geneis assayed for by detecting the level of cancer marker mRNA expressed bythe cell. mRNA expression can be detected by any suitable method.

In other embodiments, the effect of candidate compounds on expression ofcancer marker genes is assayed by measuring the level of polypeptideencoded by the cancer markers. The level of polypeptide expressed can bemeasured using any suitable method, including but not limited to, thosedisclosed herein.

Experimental

The following examples are provided in order to demonstrate and furtherillustrate certain preferred embodiments and aspects of the presentdisclosure and are not to be construed as limiting the scope thereof.

EXAMPLE 1 Methods

Cell culture: The A549 human lung adenocarcinoma cell line was obtainedfrom the American type Culture Collection (Manassas, Va.) and maintainedin RPMI-1640 medium with glutamine, supplemented with 10% FBS,penicillin, and streptomycin and tested for mycoplasma contamination.All tissue culture media and media supplements were purchased from LifeTechnologies (Gaithersburg, Md.). The porcine transforming growth factorbeta 1 (TGF-β) was purchased from PeproTech (Rocky Hill, N.J.). In allexperiments cells at 40-50% confluency were serum starved for 24 h andtreated with TGF-β (5 ng/ml) for 72 h. At the end conditioned mediacollected was centrifuged at 2000 g for 20 minutes and filtered through0.2 μm filter to remove the intact cells and debris and stored at −80 0C until further processing. Cells in the culture dishes were lysed inRIPA buffer and processed for western immunoblotting for assessing theexpression of epithelial and mesenchymal markers. Protein concentrationswere determined using the BCA protein assay reagent from Pierce(Rockford, Ill., USA).

Sample preparation, SDS-PAGE and in-gel digestion: 7 ml of conditionedmedia from control and TGF-β treated cells from two independentbiological replicates was buffer exchanged into 25 mM ammoniumbicarbonate and the volume reduced to 100 μl using a 10 kDa MWCO filter(Millipore). Half of each replicate (approximately 20 μg protein fromcontrols and 10 μg protein from TGF-β treatment) was solubilized inloading buffer and resolved using Novex 4-12% gradient gels (InvitrogenLife Technologies, Carlsbad, Calif.). Each lane was manually excisedinto 40 equal slices and each slice was transferred to a well of a96-well plate. Proteins in each gel slice were robotically reduced with10 mM dithiothreitol, alkylated with 50 mM iodoacetamide, and digestedwith 160 ng trypsin (ProGest, Genomic Solutions, Ann Arbor, Mich.).Tryptic peptides were analyzed following acidification with 0.5% formicacid to a final pH 3.8. The volume of peptide mixture for each band was40 μl. (Bhattacharjee et al., Proc Natl Acad Sci USA 98:13790-5, 2001).

Data-dependent LC/MS/MS: 30 μl of each digested gel slice was analyzedusing nanoLC/MS/MS on a LTQ Orbitrap XL tandem mass spectrometer(ThermoFisher, San Jose, Calif.). Sample was loaded onto an IntegraFrit(New Objective, Woburn, Mass.) 75 μm×3 cm vented column packed with 0.5mm Jupiter C12 material (Phenomenex, Torrance, Calif.) at 10 μl/min.Peptides were eluted with a 50 min gradient (0.1-30% B in 35 min, 30-50%B in 10 min and 50-80% B in 5 min where A=99.9% H2O, 0.1% acetonitrilein 0.1% formic acid and B=80% acetonitrile, 20% H2O in 0.1% formic acid)at 300 nL/min using a NanoAcquity HPLC pump (Waters, Beverley, Mass.)over a 75 μm×15 cm IntegraFrit analytical column packed also withJupiter C12 material. The column was coupled to a 30 μm ID×3 cmstainless steel emitter (ThermoFisher, USA). MS was performed in theOrbitrap at 60,000 FWHM resolution, MS/MS was performed in the LTQ onthe top six ions in each MS scan using the data-dependent acquisitionmode. Normalized collision energy was set at 35% and 1 microscan wasused with Automatic Gain Control (AGC) implementation. AGC enables thetrap to fill with ions to the set ion target values. Target values forMS and MS/MS were 5×10⁴ and 1.5×10³ counts, respectively. Dynamicexclusion and repeat settings ensured each ion was selected only onceand excluded for 30 s thereafter.

Data Processing: Data were processed using the MaxQuant v1.0.13.8software, which provides protein identifications at a target falsediscovery rate (FDR). This version of MaxQuant utilizes a locally storedcopy of the Mascot search engine (version 2.2, Matrix Science, London,UK) and data was searched against the IPI Human v3.53 protein database.Search parameters were: product ion mass tolerance 0.5 Da, 2 missedcleavages allowed, fully tryptic peptides only, fixed modification ofcarbamidomethyl cysteine, variable modifications of oxidized methionine,N-terminal acetylation and pyro-glutamic acid on N-terminal glutamine.Selected MaxQuant parameters were: “singlets” mode, peptide, protein andsite FDR 1%, min. peptide length of 5 amino acids, minimum of one uniquepeptide per protein. Proteins identified by this (Lacroix et al., ExpertRev Mol Diagn 8:167-78, 2008).

Analysis are summarized in Table S1. In MaxQuant the quantitativemeasure of each protein is based on the sum of the chromatographic peakarea of each peptide matched, termed “intensity” For each protein a Log2ratio of expression is determined by comparing the average intensity forthat protein between the replicates of TGF-β treated and controls. Aprotein is determined as differentially expressed if it has more thantwo fold change in either direction. Log2 ratio>1 is considered asupregulation and <1 is considered as down regulation.

Annotation of secreted proteins and mapping to gene expression: Proteinswere annotated as secreted using multiple different bioinformatic toolsincluding SecretomeP (Bendtsen et al., Protein Eng Des Sel 17:349-56,2004) for non-classical and leaderless secreted proteins, TMHMM, anHMM-based method for prediction of transmembrane domains (Moller et al.,Bioinformatics 17:646-53, 2001), SignalP package that detects signalpeptides and predicts classical secreted proteins (Bendtsen et al., JMol Biol 340:783-95, 2004), PSORT II that predicts the proteinsub-cellular localization (Nakai et al., Trends Biochem Sci 24:34-6,1999), and Secreted Protein Database (SPD) (Chen et al., Nucleic AcidsRes 33:D169-73, 2005). Others were annotated as secreted proteins basedon reported empirical evidence and GO analysis.

Entrez gene identifiers corresponding to the IPI accession numbers ofidentified proteins were obtained using human IPI cross reference data(“IPI.genes.HUMAN” for IPI human release 3.65). Entrez gene identifierswere used to obtain the corresponding probe set identifiers for theassociated arrays from the Affymetrix annotation. Following the aboveprotocol all the annotated secreted proteins were mapped to previouslypublished TGF-β-induced EMT time course gene expression data set (GSE17708) from the same cell line at identical conditions (Sartor et al.,Bioinformatics 26:456-63). To match the secretome, differentiallyexpressed genes only at 72 h time point (5057 probes corresponding to3397 genes) were used for mapping. Some probes that are identified asdifferentially -expressed but with no assigned gene symbol were excluded(Garber et al., Proc Natl Acad Sci USA 98:13784-9, 2001).

Gene set enrichment and hierarchical clustering analysis: ConceptGen isa concept and gene set enrichment analysis tool (Sartor et al.,Bioinformatics 26:456-63) It will test a given list of genes for overlapand its significance with a specified concept or gene set which includesGene Ontology (GO), direct protein interactions, transcriptionalregulation, miRNA targets, gene expression datasets. Using this tool, weperformed GO cellular component, cellular process and KEGG pathwaysenrichment analsysis for the 97-gene EASP. Statistically significant(p<0.001) concepts are presented as network graphs with nodesrepresenting concepts or gene sets and edges representing statisticalsignificance of enrichment.

For clustering, the lists of oncogenic pathways included in the analysiswere compiled from the KEGG database, except for ESC list which wasbased on Porath et al., and Hassan et al., studies (Ben-Porath et al.,Nat Genet 40:499-507, 2008; Hassan et al., Clin Cancer Res 15:6386-90,2009). The expression value for each pathway, including EASP, is thearithmatic mean of all genes in that pathway giving a single value foreach pathway in a given sample. Hierarchical clustering of the Sheddenet al 442 lung adenocarcinoma tumors (Shedden et al., Nat Med 14:822-7,2008) was performed for indicated oncogenic pathways along with EASPusing TreeView, and correlations are presented as a heat map withcolumns representing individual tumors and rows representing thearithmetic mean of a pathway.

Primary tumor-derived gene expression data sets and patientcharacteristics: Four published Affymetrix microarray data setsrepresenting 908 lung tumors were used in the EASP survival analysis.The CEL files of microarray data were normalized using RobustMulti-array Average (RMA) method (Irizarry et al., Biostatistics4:249-64, 2003). Shedden et al., 442 lung adenocarcinomas (Shedden) wereused as training set (Shedden et al., Nat Med 14:822-7, 2008). The otherthree data sets were used as test sets which included Bild et al., 111adenocarcinomas and squamous cell carcinoma data set (Bild) (Bild etal., Nature 439:353-7, 2006), Okayama et al., 226 early stage (stages 1and 2) adenocarcinoma data set (Okayama) (Okayama et al., Cancer Res72:100-11, 2012), and Raponi's 129 squamous cell carcinoma data set(Raponi) (Raponi et al., Cancer Res 66:7466-72, 2006). The patientcharacteristics and clinical information for these four data sets areprovided in Table 2. At the primary end point was 5 year survival (Buyseet al., J Natl Cancer Inst 98:1183-92, 2006).

Statistical analysis method: The random survival forests developed in Rpackage by Ishwaran et al (Ishwaran H, U.B UBK: Random survival forestsfor R. R News 7:7, 2007; Ishwaran et al., Ann. Appl. Statist 2:20, 2008)was employed for the EASP survival analysis of the four microarray datasets of lung cancer, as described before (Chen et al., J Thorac Oncol6:1481-7, 2011). Briefly, The Random Survival Forest (RSF) is anensemble tree method for analysis of right-censored survival data. Eachdecision tree of forests was grown by splitting patients by comparingsurvival differences via log-rank test based on a randomly selectedsubset of variables at each node. The 1000 trees were grown for eachRSF. Once trees were built, test sets were dropped down to the trees forprediction. The cumulative hazard function (CHF) was derived from eachtree, and an ensemble CHF, an average over 1000 survival trees, wasdetermined. Mortality was obtained as a weighted sum over ensemble CHF,weighted by the number of individuals at risk at the different timepoints. Higher mortality values imply the higher risk. Mortality wasused as risk index to separate patients into three risk groups (high,medium and low-risk, one third each group) and present Kaplan-Meiersurvival curves for each group. Each tree provides a measure of itspredictive error as described by Ishwaran et al., (supra) with smallernumber indicating a better tree. The prediction error is calculated by1−C-index (i.e. the Harrell's concordance index) in the out-of-bag datawhich were not used for building a tree each time.

Variable importance scores (VIMPs) for all the variables used to growtrees were also generated. Large VIMPs indicate variables as goodpredictors for outcome whereas zero or negative values identifynon-predictive. These scores were used to refine the 97-gene EASP to the20-gene rEASP.

Cox proportional hazards regression model, Kaplan-Meier survival curveand log-rank test were used for survival analysis of individual genes ormortality index derived from RSF. The t test was used to assess thedifference of mean expression of EASP signature in clinical andpathological groups including stage, differentiation and nodal status.

Results

Quantitative identification of differentially secreted proteins duringEMT: For the analysis of secreted proteins, A549 lung adenocarcinomacells were cultured in the serum-free media, stimulated with TGF-β for72 h to induce EMT, and the conditioned media were collected fromcontrol and TGF-β treated cells for the analysis of differentiallysecreted proteins. Induction of EMT was confirmed by assessingE-cadherin, N-cadherin and vimentin expression in the cells by westernimmunoblotting as described before (Keshamouni et al., J Proteome Res8:35-47, 2009; Keshamouni et al., J Proteome Res 5:1143-54, 2006).Proteins in the conditioned media from two different biologicalreplicates were fractionated by SDS-PAGE. Each lane on the gel was cutinto 40 slices. Proteins in each gel slice were subjected to trypsindigestion and analysed by LC-MS/MS on a LTQ-Orbitrap mass spectrometer.The resulting MS/MS spectra were analysed for protein identification andquantitation using MAXQUANT as described under methods. A total of 2410proteins were identified, of which 1647 (70%) proteins were annotated assecreted using the multiple data bases and strategies described inMethods. With the criteria of at least two-fold change, 136 proteinswere identified as increased in secretion (log 2 ratio>1) and 94proteins as decreased in secretion (log 2 ratio<−1) during EMT.

Among the differentially secreted proteins various categories ofproteins were observed, including increased secretion of proteases(MMP2, MMP9, BMP1), ECM components (collagens, fibronectin, versican andSPARC), cytokines (CTGF) and cell surface receptors (mucins, CD59) thatare consistent with the migratory, invasive and immune evasive abilitiesconfered by EMT and their regulation by TGF-β.

EMT associated secretory phenotype (EASP): To identify a gene signaturethat is representative of EMT and serves as a reliable biomarker forpatient prognosis, the differentially secreted protein profile wasinterrogated with the corresponding gene expression profile (Sartor etal., Bioinformatics 26:456-63), from the same cell line and underidentical conditions. To match with the secretome, differentiallyexpressed genes only at 72 h time point were used for integration fromthe time course data set. Since the goal is to derive a measurablesignature, only proteins whose secretion is induced during EMT wereconsidered. By integrating gene and protein expression, 97 genes wereidentified that are upregulated at mRNA level by at least two-fold(p>0.01) and increased in secretion at the protein level by at leasttwo-fold irrespective of p-value and defined them as EASP (Table 1).Given the stringent p-value cut-off used for mRNA expression and minimumtwo-fold change used for concordance between mRNA and protein, thep-value was not used for protein expression in defining EASP.

For functional interpretation, EASP was subjected to gene set enrichmentanalysis using ConceptGen (Sartor et al., supra). Analysis for cellularcomponents has associated EASP with extracellular matrix, proteinaciousextracellular, collagen, basement membrane, matrix space, matrix partand matrix region part (FIG. 1A), consistent with their annotation assecretory proteins. More importantly enrichment analysis for biologicalprocesses has associated EASP with the cellular processes including celladhesion, motility, actin cytoskeleton reorganization, coagulation,acute inflammatory response, proteolysis and response to wounding andexternal stimuli (FIG. 1A). Moreover, this also demonstrates that EASPis a true representation of EMT and serves as a reliable biomarker totrack EMT.

To assess the correlation of EASP with other known oncogenic pathways,hierarchical clustering of 442 lung adenocarcinomas based on their meangene expression of the indicated pathway was performed. Clusteringanalysis yielded two distinct lung adenocarcinoma tumor groups with 50%tumors demonstrating higher expression of all pathways. Mean EASPexpression pattern correlated with mean gene expression of all theoncogenic pathways tested. These include NF-kβ, anti-apoptosis,JAK-STAT, Notch, AKT, WNT pathways and embryonic stem cells (ESC)signature (FIG. 1B). All these pathways are known to be deregulated inlung adenocarcinomas and were implicated in the regulation of EMT.

Correlation of EASP with clinical variables: The ability of the EASPsignature to stratify the patients based on tumor stage, differentiationand nodal status using the gene expression data derived from the Sheddenet al 442 lung adenocarcinoma patients was determined (Shedden et al.,Nat Med 14:822-7, 2008) (FIG. 2A). The EASP signature was able toidentify the patients with well differentiated tumors from moderatelyand poorly differentiated tumors (P<0.001). Similarly the EASP signaturewas able to separate patients with stage I tumors from stage II andstage III tumors (P=<0.01). Furthermore the EASP signature expression ishigh in patients with positive nodal status (N1-2) compared to patientswith negative nodal status (N0). Together these results indicate theclinical utility of EASP in predicting aggressive tumor behavior.

EASP stratifies lung cancer patients into low, medium and high-riskgroups with distinct survival: In order to investigate whether the97-gene EASP signature could predict the overall survival in NSCLCpatients, the Shedden data set (n=442) was used as training set. Asdetailed in Methods, a mathematical model based on an RSF algorithm wasbuilt in the training set to predict the prognostic significance of EASPwith stage, age and sex included. After locking down the model, it wastested in three independent publicly available lung cancer data sets,Bild et al, (n=111) (Nature 439:353-7, 2006), Okayama et al, (n=226)(Cancer Res 72:100-11, 2012) and Raponi et al, (n=129) (Cancer Res66:7466-72, 2006). These cohorts include lung adenocarcinoma andsquamous cell carcinoma patients. The prediction error rates were 33.6%,30.0% and 36.7%, respectively for the Bild, Okayama and Raponi data sets(Table 4). The usefulness of RSF predictors was tested using aunivariate Cox model with the mortality index as a continuous measure.The RSF prediction was significant for the Bild test set (likelihoodratio test (LRT P=0.00008), Okayama test set (LRT P=0.005) and Raponitest set (LRT P=0.02). In all three test sets low, medium and high-riskgroups were clearly separated by mortality index (FIG. 2B). The HazardRatios were 1.00, 2.17 and 3.16 for the Bild data set (logrank test,P=0.003); 1.00, 2.60 and 5.37 for Okayama data set (logrank test,P=0.002); and 1.00, 0.96 and 2.61 for the Raponi data set (logrank test,P=0.002), respectively, for low, medium and high-risk groups (Table 4).

Refining the 97-gene EASP to the 20-gene rEASP: In order to refine theEASP into a smaller gene subset that remains as effective in predictingsurvival of lung cancer patients, the 97-gene EASP was filtered based onthe variable importance (VIMP) scores generated by RSF analysis of theShedden et al, 442 data set. Higher VIMP values indicate variables withpredictive ability whereas zero or negative values identifynon-predictive variables. This resulted in a refined EASP (rEASP)comprising the top 20 genes with higher VIMP scores (FIG. 4). To assesswhether the 20-gene rEASP performs as well as the 97-gene EASP signaturein predicting patient survival, another model based on RSF algorithm wasbuilt to predict the prognostic significance of rEASP with stage, ageand sex included, using Shedden data set (n=442) as training set. Next,the predictive power of this 20-gene rEASP model was tested in all threeindependent test sets described above (Bild, Raponi and Okayama) withclinical information in Table-2. The prediction error rates were 35.6%,29.9% and 36.4%, respectively for the Bild, Okayama and Raponi data sets(Table 3). The usefulness of RSF predictors was tested using aunivariate Cox model with the mortality index as a continuous measure.The RSF prediction was significant for the Bild test set (likelihoodratio test (LRT P=0.002), Okayama test set (LRT P=0.0004) and Raponitest set (LRT P=0.007). In all three test sets low, medium and high-riskgroups were clearly separated by mortality index (FIG. 3). The HazardRatios were 1.00, 1.54 and 2.34 for the Bild data set (logrank test,P=0.03); 1.00, 2.37 and 3.75 for Okayama data set (logrank test,P=0.002); and 1.00, 2.38 and 2.79 for the Raponi data set (logrank test,P=0.02), respectively, for low, medium and high-risk groups (Table 3).

Contrary to the perception that tumor metastasis progresses in a linearand step-wise fashion, recent evidence suggests that a subset of tumorsharbor molecular alterations at an early stage that are indicative ofbad prognosis and poor patient survival (Ramaswamy et al., Nat Genet33:49-54, 2003). This demonstrates the importance of identifyingmolecular changes at an early stage that dictate clinical behavior. Thecurrent system of TNM staging cannot identify such changes. There is anurgent need to develop prognostic tests that can predict recurrence andidentify high-risk patients at an early stage when they would benefitfrom adjuvant therapy (Felip et al , Ann Oncol 16 Suppl 1:i28-9, 2005;Scagliotti et al., J Natl Cancer Inst 95:1453-61, 2003; Johnson et al.,Clin Cancer Res 11:5022s-5026s, 2005; Keller et al., N Engl J Med343:1217-22, 2000; Domont et al., Semin Oncol 32:279-83, 2005).Demonstrating the utility of such a prognostic test, a 70-gene signature(Buyse et al., J Natl Cancer Inst 98:1183-92, 2006) (Mammaprint,Agendia, The Netherlands) has been approved by FDA for breast cancerpatients (Evolution of Translational Omics: Lessons Learned and the PathForward, The National Academies Press, 2012). A 21-gene signature (Paik,Oncologist 12:631-5, 2007) (Oncotype DX, Genome Health, CA) is approvedfor breast cancers, with analogous signatures under development forprostate and colon cancers. Even though multiple gene, protein,auto-antibodies and miRNA-based profiles have been proposed for lungcancer prognosis, none to date has been approved for clinical use.

The predominant approach in deriving most of the prognostic signatureshas been, profiling differentially occurring molecular changes betweengood versus bad outcome groups, without any consideration to theunderlying tumor biology. Here a new approach to identify predictivebiomarkers by profiling the complex cellular process of EMT, which isimplicated in the initiation of tumor metastasis is described. Therationale behind this approach is that identifying proteins secretedduring the course of a critical biological process that promotes ametastatic phenotype would provide relevant, reliable and robustprognostic biomarkers. In addition, one can measure EASP at the mRNAlevel and also at the protein level, because the biomarkers of EASP arebased on the strong concordant expression of both mRNA and protein.Given that EMT is the initiating event for metastasis and may result inthe dissemination of tumor cells, measuring EASP in the primary tumor,tumor cells in bone marrow compartment and circulating tumor cellsallows the ability to track disease progression.

Consistent with the functional attributes confered by EMT and itsregulation by TGF-β, proteins were identified that are implicated intumor cell adhesion, migration, invasion, immune evasive mechanisms,extracellular matrix components, and tumor-stromal interactions. Geneset enrichment analysis of the 97-gene EASP, which is the subset of allup regulated proteins and their mRNAs, also identified biologicalprocesses that are reflective of EMT and TGF-β biology. Similarly, eventhe proteins in rEASP are representative of the functional EMTphenotype. Furthermore, clustering analysis of EASP with key oncogenicpathways showed a similar correlation to the expression of variouspathways that are deregulated in lung adenocarcinomas. These includeNF-kβ, anti-apoptosis, JAK-STAT, PTEN, AKT, WNT, Notch, Hedgehog andEGFR signaling pathways (Faivre et al., Semin Oncol 33:407-20, 2006; Sunet al., J Clin Invest 117:2740-50, 2007). Most importantly, thecorrelation of EASP expression with embryonic stem cell (ESC) signatureis consistent with the recent finding that the ESC signature isassociated with poor prognosis and worse overall survival in lungadenocarcinoma patients (Hassan et al., Clin Cancer Res 15:6386-90,2009). This correlation is also consistent with finding that EMT mayconfer stem cell like properties to breast cancer cells (Mani et al.,Cell 133:704-15, 2008). Together, these observations demonstrate thatEASP not only reflects the heterogeneity and complexity associated withoncogenesis of lung cancer, but also demonstrates the significance andrelevance of EASP biomarkers to the underlying biology of tumormetastasis.

Consistent with its prognostic significance EASP distinguished well frommoderate or poorly-differentiated tumors and stage 1 from stage 2 and 3patients. Most importantly it was strongly correlated with positivelymph node status, which is an important prognostic factor thatinfluences the therapeutic decision making and probability of lungcancer recurrence. To test the clinical utility of EASP, the RSFanalysis based survival model was built and trained on 442 primary lungadenocarcinoma tumor-derived gene expression data set, the largest lungcancer gene expression data set available with pathological, clinicaland treatment annotations (Shedden et al., Nat Med 14:822-7, 2008).Using the variable importance (VIMP) scores from the training set(Shedden et al, supra), the EASP was refined into a subset of 20 genes(rEASP) with highest VIMP scores and tested its prognostic significancein three independent lung cancer data sets. Since the EASP was derivedfrom an adenocarcinoma cell line, it was investigated whether EASP isspecific to adenocarcinomas or does it have any relevance to othersubtypes of lung cancer. To address this, the Okayama et al, data set ofonly adenocarcinomas (n=226), the Bild et al, data set ofadenocarcinomas (n=58) and squamous cell carcinomas (n=53), and theRaponi et al, data set of only squamous cell carcinomas (n=129) wereselected for independent testing. In all three independent test setsboth EASP and rEASP based models were able to stratify patients intolow, medium and high-risk groups with clearly separated mortalityindexes and distinct hazard ratios. This demonstrates the relevance ofthese models for lung SCC. Since rEASP predicted the survival of earlystage patients it, provides a prognostic signature to identify thehigh-risk early stage patients who might benefit the most from adjuvanttherapy. Multiple inhibitors of EMT in lung cancer have been identified(Reka et al., Mol Cancer Ther 9:3221-32, 2010; Reka et al., J ThoracOncol 6:1784-92, 2011). Such inhibitors find use in patients with highEASP expression.

TABLE 1 List of Genes that Constitute EASP with Corresponding FoldChange for Gene and Protein Expression. 20 Genes that are part of rEASPare in bold and underlined. Fold change (TGF β/control) Gene SymbolEntrez ID Protein Acc Id Gene Title Microarray Secretome ADAM19 8728IPI00011901 a disintegrin and metalloproteinase domain 19 (meltrin beta)17.47 26.78 ANGPTL4 51129 IPI00153060 anglopoietin-like 4 19.67 2.33AP1S2 8905 IPI00909244 Adaptor-related protein complex 1, sigma 2subunit 2.15 3.23 ARPC4 10093 IPI00925052 actin related protein 2/3complex, subunit 4, 20 kDa 2.03 1.41 BMP1 649 IPI00014021 bonemorphogenetic protein 1 4.68 17.90 BPGM 669 IPI002159792,3-bisphosphoglycerate mutase 3.85 4.23 CD151 977 IPI00298851 CD151antigen 1.81 2.09 CD59 966 IPI00011302 CD59 antigen 3.99 2.53 CHST1150515 IPI00099831 Carbohydrate (chondroitin 4) sulfotransferase 11 4.624.00 CHST3 9469 IPI00306853 carbohydrate (chondroitin 6)sulfotransferase 3 4.30 3.61 COL1A1 1277 IPI00297646 collagen, type I,alpha 1 10.11 3.44 COL4A1 1282 IPI00743696 collagen, type IV, alpha 162.47 5.41 COL4A2 1284 IPI00306322 collagen, type IV, alpha 2 15.99 5.24COL4A3 1285 IPI00010360 collagen, type IV, alpha 3 (Goodpasture antigen)3.18 4.57 COL7A1 1294 IPI00025418 collagen, type VII, alpha 1 3.42 1.35CRIP2 1397 IPI00921911 cysteine-rich protein 2 2.86 19.22 VCAN 1462IPI00215628 chondroitin sulfate proteoglycan 2 (versican) 2.97 1.89 CTGF1490 IPI00020977 connective tissue growth factor 4.96 4.37 CXCL12 6387IPI00719836 chemokine (C-X-C motif) ligand 12 (stromal cell-derivedfactor 1) 5.56 17.71 CYR61 3491 IPI00299219 cysteine-rich, angiogenicinducer, 61 5.66 2.16 DSC2 1824 IPI00025846 desmocollin 2 4.12 3.15 ECM11893 IPI00645849 extracellular matrix protein 1 2.09 3.64 EFNA1 1942IPI00025840 ephrin-A1 1.72 1.14 EIF4EBP1 1978 IPI00002569 eukaryotictranslation initiation factor 4E binding protein 1 1.94 20.54 EPHB2 2048IPI00021275 EPR receptor 62 8.07 1.76 FHL2 2274 IPI00396967 four and ahalf LIM domains 2 3.15 3.31 FN1 2335 IPI00845263 Fibronectin 1 5.143.15 FST 10468 IPI00021081 follistatin 6.45 1.30 FSTL1 11167 IPI00029723follistatin-like 1 5.14 2.18 FSTL3 10272 IPI00025155 follistatin-like 3(secreted glycoprotein) 3.98 2.21 C11orf41 25758 IPI00852979 G2 protein3.98 20.18 GALNT2 2590 IPI00004669UDP-N-acetyl-alpha-D-galactosamine:polypeptide 2.78 1.69N-acetylgalactosaminyltransferase 2 (GalNAc-T2) GSN 2934 IPI00646773gelsolin (amyloidosis, Finnish type) 3.99 2.70 HMGA2 8091 IPI00005996high mobility group AT-hook 2///high mobility group AT-hook 2 4.20 3.13HMOX1 3162 IPI00215893 heme oxygenase (decycling) 1 3.41 17.36 HSPB13315 IPI0025512 heat shock 27 kDa protein 1 2.28 2.42 IGF1 3479IPI00433029 insulin-like growth factor 1 (somatomedin C) 4.65 1.11 IGF23481 IPI00215977 insulin-like growth factor 2 (somatomedin A) 1.03 1.87IGFBP5 3488 IPI00029236 insulin-like growth factor binding protein 521.60 7.17 IGFBP7 3490 IPI00016915 insulin-like growth factor bindingprotein 7 63.94 5.33 IL11 3589 IPI00025820 interleukin 11 39.43 5.57INHBA 3624 IPI00028670 Inhibin, beta A (activin A, activin AB alphapolypeptide) 24.32 25.47 ITGA2 3673 IPI00013744 Integrin, alpha 2(CD49B, alpha 2 subunit of VLA-2 receptor) 5.13 1.15 ITGA3 3675IPI00290043 Integrin, alpha 3 (antigen CD49C, alpha 3 subunit of VLA-3receptor) 1.90 1.80 JAG1 182 IPI00099650 Jagged 1 (Alagille syndrome)3.10 1.87 MGC17330 113791 IPI00298388 phosphoinositide-3-kinaseinteracting protein 1 2.53 20.35 KIAA1797 54914 IPI00748360 KIAA17971.70 19.36 LAMC2 3918 IPI00015117 laminin, gamma 2 20.02 7.45 LEFTY27044 IPI00010893 left-right determination factor 2 2.23 23.89 LIF 3976IPI00009720 leukemia inhibitory factor (cholinergic differentiationfactor) 2.71 2.22 XYLT1 64131 IPI00183487 hypothetical protein LOC23382412.13 23.61 LTBP1 4052 IPI00784258 latent transforming growth factorbeta binding protein 1 2.86 2.25 LTBP2 4053 IPI00292150 latenttransforming growth factor beta binding protein 2 9.42 4.66 LTBP3 4054IPI00073196 latent transforming growth factor beta binding protein 33.87 1.55 LTBP4 8425 IPI00873371 latent transforming growth factor betabinding protein 4 3.01 2.14 PIK3IP1 113791 IPI00296388 HGFL gene///HGFLgene 2.53 20.35 MMP1 4312 IPI00008561 matrix metailoproteinase 1(interstitial-collagenase) 5.81 6.27 MMP10 4319 IPI00013405 matrixmetailoproteinase 10 20.03 25.23 MMP2 4313 IPI00027780 matrixmetailoproteinase 2 10.86 8.36 MMP9 4318 IPI00027509 matrixmetailoproteinase 9 1.34 22.00 MRC2 9902 IPI00005707 mannose receptor, Ctype 2 4.97 1.47 NPC2 10577 IPI00301579 Niemann-Pick disease, type C22.88 3.33 NPTX1 4884 IPI00220562 neuronal pentraxin 1 7.29 7.41 NRG13084 IPI00221375 neuregulin 1 3.27 2.19 PAWR 5074 IPI00001871 PRKC,apoptosis, WT1, regulator 2.34 20.17 PCDH1 5097 IPI00872579protocadherin 1 (cadherin-like 1) 4.48 22.09 PDGFB 5155 IPI00000044platelet-derived growth factor beta polypeptide 4.86 18.85 PDLIM2 64236IPI00007983 PDZ and LIM domain 2 (mystique) 2.74 16.79 PGRMC2 10424IPI00005202 progesterone receptor membrane component 2 1.85 2.20 PLAT5327 IPI00019590 plasminogen activator, tissue 4.41 19.85 PLAUR 5329IPI00010676 plasminogen activator, urokinase receptor 3.07 2.22 PLOD25352 IPI00337495 procollagen-lysine, 2-oxoglutarate 5-dioxygenase 2 2.812.03 PLSCR3 57048 IPI00216127 phospholipid scramblase 3 2.25 3.22PPP1R14B 26472 IPI00398922 protein phosphatase 1, regulatory (inhibitor)subunit 14B 2.11 1.55 HTRA1 5654 IPI00003176 protease, serine, 11 (IGFbinding) 2.63 3.00 PTPRK 5796 IPI00470937 protein tyrosine phosphatase,receptor type, K 6.19 1.64 RSU1 6251 IPI00847168 Ras suppressor protein1 2.00 1.41 SCG2 7857 IPI00009362 secretogranin II (chromogranin C)31.96 7.63 SEMA3C 10512 IPI00019209 sema domain, immunoglobulin domain(Ig), short basic 4.87 3.70 domain, secreted, (semaphorin) 3C SERPIN E15054 IPI00007118 serine (or cysteine) proteinase inhibitor clade E(nexin, 41.73 4.13 plasminogen activator inhibitor type 1), member 1SERPINE2 5270 IPI00009890 Serine (or cysteine) proteinase inhibitor,clade E (nexin, 17.90 4.36 plasminogen activator inhibitor type 1),member 2 SPOCK1 6695 IPI00005292 sparc/osteonectin, cwcv and kazal-likedomains proteoglycan (testican) 70.53 5.84 STC 1 6781 IPI00005564Stanniocalcin 1 10.09 3.99 TAGLN 6876 IPI00216138 transgelin 13.35 6.44TAGLN2 8407 IPI00647915 transgelin 2 2.24 1.17 TAX1BP3 30851 IPI00005585Tax1 (human T-cell leukemia virus type I) binding protein 3 1.83 3.54TGFB1 7040 IPI00000075 transforming growth factor, beta 1 2.57 2.22TGFBR1 7046 IPI00005733 Transforming growth factor, beta receptor 1 5.722.72 THBS1 7057 IPI00296099 thrombospondin 1 16.73 4.71 TIMP2 7077IPI00027166 tissue inhibitor of metalloproteinase 2 4.72 2.54 TLL2 7093IPI00465231 tolloid-like 2 2.59 2.15 TNFAIP6 7130 IPI00303341 tumornecrosis factor, alpha-induced protein 6 5.03 7.48 TNFRSF12A 51330IPI00010277 tumor necrosis factor receptor superfamily, member 12A 5.101.83 TP53I3 9540 IPI00384643 tumor protein p53 inducible protein 3 3.352.47 TUBA4B 80086 IPI00017454 tubulin, alpha 4 2.24 2.20 ULBP2 80328IPI00018660 UL16 binding protein 2 5.18 3.51 VEGFA 7422 IPI00012567vascular endothelial growth factor 4.66 2.86

Clinical Characteristics of Sample Used in This Study Data set Sheddenset Bild set Raponi set Okayama set Sample number 442 111 129 226 58Ad53 Ad Type of cancer Ad SCC SCC Age average 64.4 64.8 67.5 59.5 GenderFemale 219 48 48 121 Male 224 53 82 105 Stage Stage I 276 67 73 168Stage II 105 18 34 58 Stage III 59 21 23 0 Differentiation Well 60 NA 15NA Moderate 209 NA 76 NA Poor 167 NA 39 NA Dead (5 year) 168 58 52 32Alive 255 53 78 194 Adjuvant therapy Yes 109 48 No 330 69 204 Unknown 3111 12 22 Abbreviation: Ad, adenocarcinomas: SCC, squamous cell cancer.Adjuvant therapy includes chemo- and/or radio-therapy.

TABLE 3 Prediction Results of 20-gene rEASP Signature on Three TestSets. Cox Log rank RSF* model** test*** Test error rate P HR 95% Cl POkayama test set (n = 226) 29.9%% 0.0004 Low risk 1 0.002 Medium risk2.37 1.03-5.46 High risk 3.75 1.70-8.28 Raponi test set (n = 129) 36.4%0.007 Low risk 1 0.02 Medium risk 2.38 1.12-5.10 High risk 2.791.32-5.90 Bild test set (n = 111) 35.8% 0.002 Low risk 1 0.03 Mediumrisk 1.54 0.78-3.02 High risk 2.34 1.24-4.42 MRI = mortality risk index;Med = Medium. *RSF prediction model built from the 442 training setincluding 20 genes, age, gender and stage. **MRI as continuous value,liklihood ratio test (LRT) was used in univariate Cox model. ***MRIseparated test patients to 3 risk froups (low, medium and high risk, ⅓rdin each group).

TABLE 4 Cox Log rank RSF* model** test*** Test error rate P HR 95% Cl PBild test set (n = 111) 33.6% 0.00008 Low risk 1 0.003 Medium risk 2.171.08-4.34 High risk 3.16 1.60-6.24 Okayama test set (n = 226) 30.0%0.005 Low risk 1 0.002 Medium risk 2.60 0.81-8.28 High risk 5.371.82-15.88 Raponi test set (n = 129) 36.7% 0.02 Low risk 1 0.002 Mediumrisk 0.96 0.44-2.07 High risk 2.61 1.36-5.01 MRI = mortality risk index;Med = Medium. *RSF prediction model built from the 442 training setincluding 97 genes, age, gender and stage. **MRI as continuous value,liklihood ratio test (LRT) was used in univariate Cox model. ***MRIseparated test patients to 3 risk froups (low, medium and high risk,⅓rdin each group).

Although a variety of embodiments have been described in connection withthe present disclosure, it should be understood that the claimedinvention should not be unduly limited to such specific embodiments.Indeed, various modifications and variations of the describedcompositions and methods of the invention will be apparent to those ofordinary skill in the art and are intended to be within the scope of thefollowing claims.

We claim:
 1. A kit for determining a prognosis in a subject diagnosedwith lung cancer, comprising: reagents for detection of alteredexpression of one or more genes selected from the group consisting ofHMGA2, LAMC2, EIF4EBP1, FN1, HSPB1, ULBP2, VEGFA, BMP1, FST, CD59,TNFRSF12A, PDGFB, COL4A3, SERPINE1, SCG2, TGFB1, HMOX1, STC1, XYLT1, andPPP1R14B.
 2. The kit of claim 1, further comprising reagents fordetection of altered expression of one or more genes selected from thegroup consisting of ADAM19, ANGPTL4, AP1S2, ARPC4, BPGM, CD151, CHST11,CHST3, COL1A1, COL4A1, COL4A2, COL7A1, CRIP2, VCAN, CTGF, CXCL12, CYR61,DSC2, ECM1, EFNA1, EPHB2, FHL2, FSTL1, FSTL3, C11orf41, GALNT2, GSN,IGF1, IGF2, IGFBP5, IGFBP7, IL11, INHBA, ITGA2, ITGA3, JAG1, MGC17330,KIAA1797, LEFTY2, LIF, LTBP1, LTBP2, LTBP3, LTBP4, PIK3IP1, MMP1, MMP10,MMP2, MMP9, MRC2, NPC2, NPTX1, NRG1, PAWR, PCDH1, PDLIM2, PGRMC2, PLAT,PLAUR, PLOD2, PLSCR3, HTRA1, PTPRK, RSU1, SEMA3C, SERPINE2, SPOCK1,TAGLN, TAGLN2, TAX1BP3, TGFBR1, THBS1, TIMP2, TLL2, TNFAIP6, TP53I3, orTUBA4B.
 3. The kit of claim 1, wherein said kit comprises reagents fordetection of 5 or more of said genes.
 4. The kit of claim 1, whereinsaid kit comprises reagents for detection of 10 or more of said genes.5. The kit of claim 1, wherein said kit comprises reagents for detectionof 15 or more of said genes.
 6. The kit of claim 1, wherein said kitcomprises reagents for detection of all of said genes.
 7. The kit ofclaim 1, wherein said reagents are affixed to a solid support.
 8. Thekit of claim 7, wherein said solid support is an array.
 9. The kit ofclaim 1, wherein said reagents are selected from the group consisting ofnucleic acid probes that bind to a nucleic acid encoding said gene, apair of amplification primers that bind to a nucleic acid encoding saidgene, a sequencing primer that binds to a nucleic acid encoding saidgene, and an antibody that binds to a polypeptide encoded by said gene.10. The kit of claim 1, wherein said prognosis is selected from thegroup consisting of survival, lymph node metastasis, and advancement ofsaid lung cancer.
 11. The kit of claim 1, wherein said lung cancer islung adenocarcinoma or squamous cell carcinoma.
 12. The kit of claim 1,wherein said lung cancer is early stage lung cancer or advanced lungcancer.
 13. A composition comprising one ore more reaction mixtures,each reaction mixture comprising a complex of a reagent for detection ofone or more genes selected from the group consisting of HMGA2, LAMC2,EIF4EBP1, FN1, HSPB1, ULBP2, VEGFA, BMP1, FST, CD59, TNFRSF12A, PDGFB,COL4A3, SERPINE1, SCG2, TGFB1, HMOX1, STC1, XYLT1, and PPP1R14B bound tosaid gene.
 14. The composition of claim 13, further comprising one oremore reaction mixtures, each reaction mixture comprising a complex of areagent for detection of one or more genes selected from the groupconsisting of ADAM19, ANGPTL4, AP1S2, ARPC4, BPGM, CD151, CHST11, CHST3,COL1A1, COL4A1, COL4A2, COL7A1, CRIP2, VCAN, CTGF, CXCL12, CYR61, DSC2,ECM1, EFNA1, EPHB2, FHL2, FSTL1, FSTL3, C11orf41, GALNT2, GSN, IGF1,IGF2, IGFBP5, IGFBP7, IL11, INHBA, ITGA2, ITGA3, JAG1, MGC17330,KIAA1797, LEFTY2, LIF, LTBP1, LTBP2, LTBP3, LTBP4, PIK3IP1, MMP1, MMP10,MMP2, MMP9, MRC2, NPC2, NPTX1, NRG1, PAWR, PCDH1, PDLIM2, PGRMC2, PLAT,PLAUR, PLOD2, PLSCR3, HTRA1, PTPRK, RSU1, SEMA3C, SERPINE2, SPOCK1,TAGLN, TAGLN2, TAX1BP3, TGFBR1, THBS1, TIMP2, TLL2, TNFAIP6, TP53I3, orTUBA4B bound to said gene.
 15. The composition of claim 13, wherein saidreagents are selected from the group consisting of nucleic acid probesthat bind to a nucleic acid encoding said gene, a pair of amplificationprimers that bind to a nucleic acid encoding said gene, a sequencingprimer that binds to a nucleic acid encoding said gene, and an antibodythat binds to a polypeptide encoded by said gene.
 16. A method fordetermining a prognosis of a subject diagnosed with lung cancer,comprising: contacting a sample from a subject diagnosed with lung withreagents for detection of altered expression of one or more genesselected from the group consisting of HMGA2, LAMC2, EIF4EBP1, FN1,HSPB1, ULBP2, VEGFA, BMP1, FST, CD59, TNFRSF12A, PDGFB, COL4A3,SERPINE1, SCG2, TGFB1, HMOX1, STC1, XYLT1, and PPP1R14B.
 17. The methodof claim 19, further comprising contacting said sample with reagents fordetection of altered expression of one or more genes selected from thegroup consisting of ADAM19, ANGPTL4, AP1S2, ARPC4, BPGM, CD151, CHST11,CHST3, COL1A1, COL4A1, COL4A2, COL7A1, CRIP2, VCAN, CTGF, CXCL12, CYR61,DSC2, ECM1, EFNA1, EPHB2, FHL2, FSTL1, FSTL3, C11orf41, GALNT2, GSN,IGF1, IGF2, IGFBP5, IGFBP7, IL11, INHBA, ITGA2, ITGA3, JAG1, MGC17330,KIAA1797, LEFTY2, LIF, LTBP1, LTBP2, LTBP3, LTBP4, PIK3IP1, MMP1, MMP10,MMP2, MMP9, MRC2, NPC2, NPTX1, NRG1, PAWR, PCDH1, PDLIM2, PGRMC2, PLAT,PLAUR, PLOD2, PLSCR3, HTRA1, PTPRK, RSU1, SEMA3C, SERPINE2, SPOCK1,TAGLN, TAGLN2, TAX1BP3, TGFBR1, THBS1, TIMP2, TLL2, TNFAIP6, TP53I3, orTUBA4B.
 18. The method of claim 19, wherein said prognosis is selectedfrom the group consisting of survival, lymph node metastasis, andadvancement of said lung cancer.
 19. The method of claim 19, whereinsaid lung cancer is lung adenocarcinoma or squamous cell carcinoma. 20.The method of claim 19, further comprising the step of determining atreatment course of action based on said characterizing.